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distribution of PHI-BLAST alignment scores is studied analytically and empirical). In many 
^ r is able to detect statistically significant similarity between homology 

i H ,t • re not reco-ni/ablv related using traditional single-pass database search methods. 
SIS s^ Xllnalysis of ( K.)4-.iUc cel. death regulators, .l^-typc A 1 rPase 
domains', arehaeal tRNA nucleotidyltransferases and archaea. homo.ogs o. l)na(.-,ype D.N A 
primases. 

INT RODliCTION 

focuses upon a small region 



In the a,vilvsis of a protein or ON A sequence, particular interest often focuses up 

, e patten, A natural question is whether there are other related sequences t at sha, 

I s e'patt rn. The most widely used tools for sequence similanty search alio, match,,,, bow,. 

b, -uv re«ions of the qucrv and database sequences do). In contrast, many motd-bascd s.aich 
I o is see database sequences that match a pre-speeilled pattern (6-12,. 1 th.s pattern is too e. 
not spiinod with sufficient precision, the number of matches mas be very argc. moM bemgo no 
1 , relevance. On the other hand, an overly -specific pattern may exclude many sequence ol 



interest. 



We describe here the pattern-hit initiated Bl AS I < PI il-BI.AS 1 ) program, whose hybrid strategy 
> ' ^ o "cuUion fr-wnUv asked bv researchers: namely, is a particular patter,! seen m a 

recant, or docs it occur s.mply *^«*%*> 
Question, we combine a pattern search with a search lor statistically ^ U ^ ^™^% 
These two approaches were combined previously in a program that explored the output ol a ' ' ^ 
search lor a , served patterns ,10). PHI-BLAST implements a reverse strategy wh.ch ,s compu a UonalK 

and which vye believe yy.ll be of greater unlaw Specifically the sumlanty scare ! s 
restricted to a subset of the sequence database comprised of the sequences that contam th. g.wn pattun. 

t^^:^^ PROS.TH patterns (12)- *r example have 
h* form 1 or each match between an instance of the pattern in the query sequence and an ,„>un , uk. 
^ta* "slqience. PHI-BLAST constructs a h,gh-scor,ng .oca, alignment that includes the match. All 
resulting alignments arc sorted by score and evaluated statistically. 

This annroach has ureatcst utility when it is suspected that a few residues comprising a small motif may 
Tc^rteM^ function of interest. Showing that this pattern occurs w.thtn an ex civde d and 
t is i alh s.uni ficant Alignment of the query sequence w ith one or more database sequent ^ uly 
, lilvMbood that the pattern is spurious. Conversely, insisting on the presence ol the pattern 
^c«S Jcducedsequence s'pace may aid the detection of subtle similarities that blend mto 
the background noise in a regular BI.AS 1 search. 

THE PHI-BLAST ALGORITHM 

To search for matches to a uiven pattern, we adapted a method of Bae/a-Yates and Cionnet (13) and \Yu 
11 4 This method permits simple patterns to be represented in a single computer wo d and 

mi he u bifound ven efficiently. When the pattern is relatively complex, lor example cons,s mg ol 
nv riuid parts and or haying wide ranges of space, iengd,. our program m,i „• , - 

part Wis least likelv to match by chance alone, and then pertorms local searches toi th, umamm. 



pattern elements. 
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is can 



For each insumcc of Ibe inpwt nawcrn in ****** = ^-^J^STf 
PIII-Bl.ASl" aUemp.s u, llnd ,l,e op.nnal looal ' ^ ' ^ ^> n iK mu s , ,,,,,,,-cs 

be done rigorously hy applying dynamio J«^^^^^ « rJ» a. .he eo,ner 
preceding and .1,0 parts lollow.ng the pa Horn, ho J?- d |his ,,,, ,„ 

„ ,ho pad, graph. M is pernnuod ,o end an>«hcrc « . h , a I anJ „„, 

u . ,„aran,ee opnn.di.y . a very iarge por on • ,1k — uscJ „ k cd tM ,„„„„ 

requires inordinate lime in a dalahaso sojie 1, p Aie s ■ , oolK .„., Cl „„idered 

heuris.,0 desenbed in AUsohul c, ai ,2, and /hang c < 1^ - ■ " S - wc ., 

..my — i« a,i ^: i r ;:'; n ;, : r , : « on.™.! 

found. For suiiiciently large values ol tin. A paiamuu. 1 i 
local alignment. 

.Pill HI \S f informs a capped extei^ 
Because PI11-BFAS1 puioims -11 . ,„„,.,„, linol1 sUC |i instances bourn relali\el> 



lioun.ered in ,he darabase. reasonable eseounon ,,n,s dope,, unon - 

„oac,d„,,„,e„o,os i; o, 

STATISTICS ANALYSIS 

, n ^nment 1 produced bv PHI-B1 AST may be divided into three parts: the region l (i span,,.! b> the 
^tern an the loca, aliments , , and , 2 produced to either side of , () b> the gap P eU este on 
n "Either or both of, , and may be empty. Correspondingly, the score X ot the ahgnment K 
, c ' ,ndS T For the purpose of statistical analysis, it .s eas.est to a,sumetlutodl 

divided into the scores .S 0 . .S, and S y \ pu.f nl;nKihll „v. and therefore to 

alignment reeions that satisfy the input pattern are o, e M ua, , ^ , - ■ 

aii t iiiiiwiii () , . pin i)i \s I s runkcv b\ Us reduced .^ui^ 

a reduced score .S'[ prime] > .v purely by chance. 

,» g enera>. .be input P a„er„ is ehosen beeause is ^^1^^^ » 

• m ,v ! ■) PIU-B! AST local alignment, 
number of distinct pattern pairs that max s^d allium 

,,e s„„ples, n,ode, ol>o,e,„ sendees is as ran, 

,. • i . , -1.,,-t -.t •! nmieul'ir noint /'. he much studied Smith- W .Human 

" PUm "' ^ •'r 0 "' uh"ll . e I " ,,, soore. „,asi„,„ed over all pa.h graph - 

alignment score ( I ) ^ iu>t this conMiainna k uu in , n ; n ^Hv to lolipu ;m extreme 

ZulK dislrtbunon'ol-.S, should have an e.ponenna, uni. .» 



Kids Kc- >->-' ;,,s " 



reduced score al least .v is 

HMprimol > v) |appr»x,ma.el> euuals| ( MllanrMalv - 1 K 

,4,. and .hen val,d,ly «*• a I*** »* t, taL* w .<«* 

emplosed Lore J.rter sligl.ll> Iron, diose puhlisl.sJ ' ! ' ^ . m , |d( , , • 0 r, q „„u„„ I is ne« 

|| imM al using larger 1 therefore more ^^^nTSV t«to~J amino «<1 
,ul requires HsounesUmarion Rando,^ 

rrap.eneiesorRobinsonandRotason^ ^^' " J; ' standard pr.aern seqr.aree c,a,,pariso„ 

^^TGS£Z&'^-~* 

IMPLEMENTATION AN D EXAMPLES 

To enhance the utility and functionality o< a WWWW -^^^ „ 

between two other programs. While one may or anv ucll-clu»actori,cd 

the query sequence, a researcher oiten wishes , h a pa an cLU a ogram lhal 

motifs the query may contain. 1 o streamline th found miv lhcn bc used to launch 

first searches the PROSH b database (12) w.th the u a n> pauu n I u w alltm u 

a PHI-BLAST database search. To tacihtate more d a L l ^M " ■ ^ ^ , a , ;ll „c 

automatically to serve as the basis .or construcu P-u on s P ^ ^ } ( ^ ^ „-\S*l family 

searching via the position-specl.c ite.ated B LAS . < K .. . , compoMll0U 

proerams. PHI-BLAST incorporates a pre-liltu lo V^Wmi^ 

(low complexity) that often corrupt database searches (J^). 



BIAS I . we ha\e nested U 

.e concerning 



>ase similarity searches, but this 



PHI-BLAST may detect subtle relationships that esc ape sU nd*d dat ba - ^ lh 

potential depends upon the specification oi an an d P - ^ ori , inal 

protein «amily of interest. We discuss our exam lcs u hi ,m cs In each case, 

description depended critically upon detee m k lat , al-q ^ y ^ 

PHI-BLAST reports a subtle but structure h a d lun, ^ ^ output 

suggesting these relationships are not ; t s at.s c > clcL , lanlll> incm ber, 

ranked by /.value, they appear ,mmed,a e l> t H a i n. ^ m AS -, 

thereby prompting further analysis. In co t a^ " «^ ■ I m . 

ai -e preceded by a number ot alignments « th s maUu /. x al i, ^ ^ 

examples discussed bekm are summari/.ed in I abk L All -a ^ ' 
Redundant <NR. protein sequence database maintained by th. M Bl iJH 

Table i. Detection of subtle protein sequence relationships using PI AS I ■ 



our 



hup 



,,1, --/luuvjL'Kil :(>(!")■ 



r , .,,-h s of the NC1M < ™> non-redundant protein sequence database I April 
The reported results are lrom seaich.s ol the ^ 11 ;1 |,onthnis used the 

y ,wk: K42 sequences: 90 087 406 rescues,. 1 he I 1- • .d ^ . h a 

— * ^s^s^^L, - ;;1: -r;;, 



applvim: an ede.e-el led correction (41-11 "-"'-^ 

statical parameters |lambda| = 0.270 and . >. dloj ^, .,„,„„, iin> ,„ 

c; 

aeid. 

f 1 l)4-like c " ''path regulators 

t • MTU is -i regulator of programmed cell death lapoplosis). CI .1)4 
The Caenorhalnlms cleans protean C 1,1)4 ^ a c uU ; fbund , n a grcal variety of 

contains the classical P-loop motif invc lu. d n phosp a tc . . ^ ^ ^ 

A I Pases and (i f Pases. A 1 P binding b> C LD4. and U e c > _ statistically 
demonstrated (31.32). In a gapped BIAS 1 ^^.^^ Llator Apaf-L in »hkb the 
sieni.leant sequence similarity to only one pro h h > ape , . . ^ ^ ^ 

•'-'-P-- 1 --^ 1 ^^ 1 ^^^;; 00.8 su>, plant cSisease resistance prote,,t. 

(Table 1 ). the best hit alter Apal-1. \Mth - ^lue u.u..< » . , a,x.pt->«> 

„,,/, ; ^,v/.v thaUana T7N9.18 (35). further ^^-^^^c several conserved motifs, 
^ulators and putative plant ^'^^ cs in p , 0 ,rammed cel. death 



jesting that .hoy have a common ori S m and may ( J J , ^.uenee became 

„..Arav,nd. V.M..W * ; olTl . W ,., a laryc n«nK, 



HS90-t\pc ATPase domains 

We used PHI-BLAST to investigate the subtle but 1 1S90 

ATPase domains in the MutL DNA repa,^^^ lhal contain the same 

familv proteins (36.37). The output ident, led ^ ^ 1 1 J tarehe, lo not slum significant 
tvpe of predicted ATPase domain, but that m sta.K a.d d ; u ^ . £ . ,,. //t , nc ./ ; „ ,„„ 

similarity to any known member ol the super am A 1 U .LAS 1 c ( . ^ /;s p , )lcm 

MutL protein (38) as query showed moderate ,m,la,.t> ( * . _ U)canalin p. 

7C155 3 (39) that was originally described as having ue k > mla t> to m humans 

SubsequenTdautee searches^ 

!^^^ * of-siderab, interest: the synaptocanahn domain apparentb 

uas fused to the worm protein by exon misassemhl> . 
Archaeal tP^* nucleotid yltransferases 

.. , ,. f f..„ ,,v a distinct familv of nucleic acid polymerases (43) 

^^^^^^^^ ^ <™ ' ' * " ' 

^ - ,,l ^ ;s I'M 



^ Mt' 10 



J, R cv -/ha>ii!cUil 2Ut\~) 



... , m ,., n ni ,\ST search with Mcihunococcns jannuschii C'ca (45) as query, the top hit outside 
sped, d m . 1 I - 1 \S ^ h lhcUcal in A ,: () ;,„, llom 

lhc u-ehaeaU ca m > ts . » W- , amil> ol pivJltU ,, 

staTtom^inrMprimcllprinKl-adonyljllranskrasclroni 



Table 2. Accuracy ofPl ll-BI.AST statistics 



.ise. .ising the 
s\ stein anel 



PHI-HI AST searches were performed ..n shuffled and reversed versions ol'lho \l< da.ak 

, .; , .tmiiomsol'I'-iMe I » V HI * Ihc some a hfomcra 
query sequences and assoc ., cd a Ucn,s o U . |f ] ,\ , |,,, c 

suuislieiil parameters lambda| and < . A. I ' " kL \ ul ;, mi>1 „„ s ,u ],,,,,( ,-tvpc 1AA prnnascs 

domains: C. archaeal iRNA nuclcoudyllranslerascs: I), archacal liumoloes ol Dual. l>| 

Archaeal liomo l<» p< «f »naG-tv p<' DMA i>rimasys 

- rchaca. honrolorts ofhactor.al 1)\A pnmases. e , ,U/«- M« 2* fig**™" 

I p ol hehcascs ,47,. hu, do no, sh.™ sicnlica,,, sunilano ,o d,ese , ; ,o,en,s a, . ^ > 

unanaed in rh,s example are undctcclanic »i,h standard database search techniques. 



PERFORM A NPF EVALUATION 

Tolcsu heaccn,rac T „nhe,>,>l-.n.AST S ,a,,s,,cs^ 
ahoccMo-e,^ 

suggests that under an alternative random protein model, basal upon us.d s,q 
statistics are slightly conservative. 

i ,f PI 11 HI \ST to thai of a standard napped BLAST program ( 5 ) we timed both 
1 o compare the speed ol 1 l lUil.AS 1 to mat - i . ( , abl( 

for searches of each of the lour examples a bm, a ^ £ ^ - ■ ■ , d [o scan thc 

. r m n ui WI'V ^need l ; or relali vc v in ormat i \ e pattc i ns I 1 1 1 ' 
determinant ol IM Il-BLA>> I s spaci. i oi icuu ,v|. lU \el\ weal, patterns. 

^r^^T^^^^ 

gapped BLAST. 



hup. 



k 



labk .},! xcculnm >pccJ ol'l'l ll-MI AS' 



version 2.0.4. Both programs cmpovej I an u ^ >» ^ SS1 , ( ,f a Son I l,r., 
parlimc ,v. This nming c.pcnmen . »as u > ^ fe ^. runs .he aaung .> Mem S, ; ,„s. 



(ONCLISION 



i i phi Ml AST helps both to ascertain the 

As illustraicd by .he biological examples isxossc, ■ | - M " ; ( ^ inRC , ,„ j olL , cl „„„ . 

biological relevance of paUcrns toecicd v, P ^ s - ,,, ,, ST xuls sp ecd,call> 

^ . i_ . Ill, m- I hi w ill PlMIlTl 



U a,no maximize search sonsimity. Thus m ^cn ;.^ j^jr • (?| , llllllcl , U)1 ,, within 
naUll , 1S a sing ,e-pass search method, to | • ^ a Mllall m „ . and . 

proteins, residues thai are absolute! u, ,n*u J Uu n . ofwn cxdlllL , num> members o a 

peeifv.ng a restricted set ol possibihucs to a t x .1 d L M- ■ a dass ofrelated 

or other tools. 

V,ehavcdcve,oped,H,^^ 

version that translates a OinA dau*oa~ ... - . nn \. t , r y, m should also find use. We also pian 10 

be particularly valuable, and a DNA-DNA » on u h u ^ o| , he tradiliona , a.'ilne 
extend PHI -BLAST so that it may use genuah/ed all.n, ,a, — 
gap costs (5i-H) currently permitted. 

Note 

So r„,e code * PHI-BLAST : ava.lahie * r ^^^^^ 
the directory 'blast', and the program may be .un Horn NU31 



r xncct I'l 11-151 AST. v.hieli 
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